Note: This exercise is adapted from the original here. As of September 2020 if you install pandas_profiling on conda you might get an old version (1.41) as it seems for this package some channels on conda are a bit older then the latest version on pypi (2.9.0 as of September 2020). To be super clear you can see the exact enviornment and library versions used to run this exercise in the Pipefile (see pipenv for more details) of this example here.

Pandas Profiling: NASA Meteorites example

Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

In [1]:
%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

In [2]:
# # uncomment and run below if you need to pip install the pandas-profiling library
# import sys
# !{sys.executable} -m pip install -U pandas-profiling==2.9.0
# !jupyter nbextension enable --py widgetsnbextension

You might want to restart the kernel now.

Import libraries

conda install -c anaconda pandas-profiling

In [3]:
from pathlib import Path

import requests
import numpy as np
import pandas as pd

import pandas_profiling
from pandas_profiling.utils.cache import cache_file

Load and prepare example dataset

We add some fake variables for illustrating pandas-profiling capabilities

In [6]:
file_name = cache_file(
    "meteorites.csv",
    
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
#     'https://data.nasa.gov/resource/gh4g-9sfh.csv',
)
print(file_name)
df = pd.read_csv(file_name)
    
# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df['year'] = pd.to_datetime(df['year'], errors='coerce')

# Example: Constant variable
df['source'] = "NASA"

# Example: Boolean variable
df['boolean'] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df['mixed'] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5,size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u'name'] = duplicates_to_add[u'name'] + " copy"

df = df.append(duplicates_to_add, ignore_index=True)
df
/Users/akiofukashima/miniforge3/envs/tf_m1/lib/python3.8/data/meteorites.csv
Out[6]:
name id nametype recclass mass fall year reclat reclong geolocation source boolean mixed reclat_city
0 Aachen copy 1 Valid L5 21.0 Fell 1880-01-01 50.77500 6.08333 (50.775, 6.08333) NASA True 1 42.143885
1 Aarhus copy 2 Valid H6 720.0 Fell 1951-01-01 56.18333 10.23333 (56.18333, 10.23333) NASA True 1 58.301088
2 Abee copy 6 Valid EH4 107000.0 Fell 1952-01-01 54.21667 -113.00000 (54.21667, -113.0) NASA True A 58.580998
3 Acapulco copy 10 Valid Acapulcoite 1914.0 Fell 1976-01-01 16.88333 -99.90000 (16.88333, -99.9) NASA True A 13.192585
4 Achiras copy 370 Valid L6 780.0 Fell 1902-01-01 -33.16667 -64.95000 (-33.16667, -64.95) NASA True A -19.466973
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1005 Adhi Kot copy 379 Valid EH4 4239.0 Fell 1919-01-01 32.10000 71.80000 (32.1, 71.8) NASA False A 33.885754
1006 Adzhi-Bogdo (stone) copy 390 Valid LL3-6 910.0 Fell 1949-01-01 44.83333 95.16667 (44.83333, 95.16667) NASA False A 48.545131
1007 Agen copy 392 Valid H5 30000.0 Fell 1814-01-01 44.21667 0.61667 (44.21667, 0.61667) NASA False A 41.135277
1008 Aguada copy 398 Valid L6 1620.0 Fell 1930-01-01 -31.60000 -65.23333 (-31.6, -65.23333) NASA True 1 -28.565801
1009 Aguila Blanca copy 417 Valid L 1440.0 Fell 1920-01-01 -30.86667 -64.55000 (-30.86667, -64.55) NASA True A -28.675330

1010 rows × 14 columns

Inline report without saving object

In [7]:
report = df.profile_report(sort='None', html={'style':{'full_width': True}}, progress_bar=False)
report
Out[7]:

Save report to file

In [8]:
profile_report = df.profile_report(html={'style': {'full_width': True}})
profile_report.to_file("tmp/example.html")

More analysis (Unicode) and Print existing ProfileReport object inline

In [9]:
profile_report = df.profile_report(explorative=True, html={'style': {'full_width': True}})
profile_report
Out[9]:

Notebook Widgets

In [10]:
profile_report.to_widgets()
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

In [ ]:
 
In [ ]: